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Large-Scale Gene Expression Data Analysis: A New 
Challenge to Computational Biologists 

Michael Q. Zhang 

Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 1 1 724 USA 

The use of high-density DNA arrays to monitor gene expression at a genome-wide scale constitutes a 
fundamental advance in biology. In particular, the expression pattern of all genes in Saccharomyces cerevisiae can 
be interrogated using microarray analysis where cDNAs are hybridized to an array of each of the -6000 genes 
in the yeast genome. In this survey I review three recent experiments related to transcriptional regulation and 
discuss the great challenge for computational biologists trying to extract functional information from such 
large-scale gene expression data. 



Ever since the theory of genetic regulation of protein 
synthesis was first worked out almost 40 years ago (Ja- 
cob and Monod 1961), biologists have been fascinated 
by how different genetic programs hard coded in the 
DNA are involved in the control and regulation of gene 
expression. This is important because different tempo- 
ral-spatial gene expression patterns relate directly to 
developmental control, morphogenesis and cell differ- 
entiation, tissue specificity, hormonal communica- 
tion, or cellular stress responses. Gene expression is 
largely controlled at the transcriptional level, and tran- 
scriptional regulatory elements are located primarily in 
the upstream promoter region of each gene; however, 
the lack of quality upstream experimental data has 
made systematic global investigations very difficult 
(Zhang 1998a; for review, see Fickett and Hatzigeor- 
giou 1997). In the past, computational genomics has 
focused mainly on gene finding (Claverie 1997; Zhang 

1997) , namely finding the protein coding region and 
extracting functional information about the protein 
product. To get equally important functional informa- 
tion about control elements of a gene, one has to ana- 
lyze functional motifs, most of which occur in non- 
coding regions (Lavorgna et al. 1998). There have been 
some excellent works on genomic identification of in- 
dividual eukaryotic transcriptional factor (TF) binding 
sites (e.g., Fondrat and Kalogeropoulos 1996; Tronche 
et al. 1997; Wasserman and Fickett 1998). The most 
tedious part of these approaches is collecting enough 
experimentally verified as-acting elements that are 
shared by a set of coregulated genes. 

Since the complete yeast genome has become 
available, there have been several attempts to compu- 
tationally identifying upstream activation sequences 
(UASs) by their clustering properties (Wagner 1997, 

1998) . This approach suffers from the simple random 
background (i.e., all oligonucleotides are Poisson dis- 
tributed) assumption and will certainly miss many TF 
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sites that do not cluster or do not cluster in a simple 
fashion. Genome-wide monitoring of gene expression 
has provided a far more effective way of systematically 
studying coregulated genes and TF sites (for the latest 
reviews, see 'The Chipping Forecast" 1999). 1 focus on 
the computational problems of identifying coregulated 
genes and their promoter elements and refer people to 
read elsewhere on other interesting and equally chal- 
lenging issues, such as data requisition/process (Mar- 
ton et al. 1998), data presentation/management (Er- 
molaeva et al. 1998), and genetic network dynamics 
(Altman et al. 1998, 1999). 

The power of DNA microarray hybridization on a 
genome scale was demonstrated early on (Schena et al. 
1995; Lockhart et al. 1996; for introduction on the ba- 
sic concept and protocols, see Stein 1998). For ex- 
ample, in a yeast diauxic shift experiment, five groups 
of distinct temporal patterns of induction or repression 
could be recognized visually, as glucose concentration 
was increased (Fig. 1). The characterized members of 
each of these groups shared important similarities in 
their functions, and common regulatory mechanisms 
could be inferred for most sets of genes with similar 
expression profiles. When searching for known UAS 
motifs in each group, many coregulated genes did 
share common TF sites (DeRisi et al. 1997). 

In preparation for regulatory sequence analysis 
from such expression data, a statistical method was 
developed (van Helden et al. 1998) based on detection 
of over-represented oligonucleotides in a target set 
of upstream sequences over all noncoding sequences 
from the genome. It was applied to 10 families con- 
taining from 5 to 38 genes; 2 of the families were ac- 
tually built from the DNA microarray expression data 
of YAP1 overexpression and TUP1 deletion (DeRisi 
et al. 1997). This method was very useful for iden- 
tifying short core motifs, which is equivalent to 
the oligonucleotide relative information method 
(Zhang 1998b) and other methods used in the follow- 
ing experiments. 
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Figure 1 Distinct temporal patterns of induction or repression help to group genes that share 
regulatory properties. Shown are five examples (8-D) of some coregulated genes as glucose 
concentration (red) dropped during the exponential growing phase (A), (Taken from DeRisi et 
al. 1997). 



I now describe real data analyses in the following 
three sets of recent whole-genome expression experi- 
ments. The first was two-point comparisons using oli- 
gonucleotide chips, which detected relative mRNA lev- 
els before and after nutrient change, heat shock, or 
mating-type switch (Roth et al. 1998). The second was 
multipoint (time-course) comparisons also using oligo- 
nucleotide chips, which detected mRNA level changes 
(at different time points after cell cycle release) relative 
to an arbitrary (but fixed) standard (Cho et al. 1998) for 
the purpose of identifying cell cycle-regulated genes. 
The third consisted of both two-point and multipoint 
comparisons, but using cDNA microarrays, which de- 
tected relative mRNA levels (at different time points 
after cell cycle release) in the synchronous cells relative 
to the control of asynchronous cells (Spellman et al. 
1998) coupled with separate experiments of CLN3 and 



CLB2 induction. In Table 1, the 
main features in these experi- 
ments are listed for easy com- 
parison. 

In a two-point experiment 
(such as in Roth et al. 1998 and 
in part of Spellman et al. 1998), 
one measures the relative ratio 
of mRNA concentration under 
two different conditions for all 
genes. After sorting these ratios 
(one ratio per gene) of mRNA 
levels, one can identify the most 
induced or most inhibited genes 
from the two extreme ends of 
the sorted list. Some criteria are 
needed for the gene selection. 
For example, in the promoter 
analysis of Roth et al. (1998), 
the upstream sequences (rela- 
tive to ATG) of the 10 ORFs were 
taken from the top and the bot- 
tom of the sorted list. Hence a 
highly induced set, a highly in- 
hibited set, and a combined set 
were used in searching by Align- 
ACE for common IMS motifs. 
The rationale for examining the 
combined set is that a single 
regulatory motif may serve as ei- 
ther a positive or a negative ele- 
ment depending on its se- 
quence context or environ- 
ment. AlignACE is a modified 
version of the Gibbs sampler 
(Neuwald et al. 1995) and was 
optimized for finding multiple 
motifs (via an iterative masking 
procedure) and for aligning 
DNA sequences on both strands. It also scores align- 
ments by frequency of occurrence in the intergenic 
regions of a given genome (for the algorithmic details, 
see Roth et al. 1998). To suppress false positives, a mo- 
tif must pass two criteria: (1) exceeding an alignment 
threshold, and (2) having an occurrence score below 
1% (i.e., <1% of genes in the genome may have this 
motif). 

For the galactose versus glucose comparison, UAS G 
motif t(T/c)CGG(C/A)(G/c)NNcT(g/c)(T/c)NNcCGG, 
which is known to regulate galactose utilization genes 
via the Gal4p/Gal80p activation complex (Lohr et al. 
1995) was found successfully, but other expected UASs 
such as the Rapl site, the Gcrl site, or the Migl site were 
not found. For the 39°C vs. 30°C comparison, the heat 
shock element (HSE) and stress response element 
(STRE) were not found. Because heat shock is known to 
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Table 1. Basic Features in Three Genorrie-W|de^ 



References 



Roth et a\!($0&j 



^Cho et al. (1998) . ! 



Spellman et al. (1998) 



Goals 



Technology 

Experiments and 
synchronization 
methods 



Cluster methods 



Putative new/total genes 
identified 

Motif tools 
Database search 

Upstream of ATC 



Gal-responsive, heat shock- ani - 
mating type-specific genes and* 
c/s-acting elements 

Affymetrix oligo chips 

two-point comparisons: 

(1) galactose vs. glucose; 

(2) 39°C vs. 30°C; ;^ 

(3) type a vs. type a 



simple sorting 



(1)3/9, (2) 6/33, (3) 7/40 




AlignACE 
TRANSFAG 

<600 bp 



>GeJl^eycle-regulated genesjand * 
r l Vc/s/acting'elements ■': , ■ 



Affy m etr i x o I ig p. „c h i p s . ■ ! ; 

multipoint comparisons:; v f 

(1 ) cdc28-l*M mutant 

(2) cdc75-2 ts mutant j 



visual identification of 
periodic peaks 



-300/416 

oligonucleotide bias 
visual extension 
TRANSFAC 
500 bp 



cDN A microarrays 

two- point comparisons: 

(1) c/r?J + vs. cln3~; 

(2) c/W + vs:c/62-; 

Multipoint comparisons: 

(1) a-f actor arrest; 

(2) elutriation; 

(3) cdc15-2 ts mutant 

simple sorting for the two-point 
data; Fourier transform and 
Pearson-type correlation for 
the multipoint 

-700/800 



oligonucleotide information 
GibbsDNA 
TRANSFAC+SCPD 
700 bp 



a (ts) Temperature sensitive. 



have broad effects, including transient cell cycle arrest 
in Gt (Rowley et al. 1993), the cell cycle activation 
(CCA) motif GCGAA(a/g)ttNT(g/c)(a/g)GAA(C/g) of 
histones was found, but other cell cycle UASs, such as 
the negative regulatory element of histones (NEG), 
Swi4/6 cell cycle box (SCB), Mini cell cycle box (MCB), 
and early cell cycle box (ECB) (see below), were not 
found. For mating-type a versus a comparison, all four 
mating type-specific elements (a2 operator, tCAAtgN- 
cAg; P box, TtCCTA ATT(a/g)GgN(c/a) (a/t); pheromone 
response element (PRE), aTGAAAC; and Q box, tCAAt- 
gNcAg) were found. Some putative motifs were also 
found, but many were suspected to be false positives. 
As only one time point (or averaged stationary time 
points) is taken for each pair of conditions in a two* 
point comparison experiment, dynamic information is 
totally lost. It is impossible to separate the primary 
transcriptional event from the downstream cascades. It 
would have been much more informative for detecting 
coregulated genes if measurements had been taken at 
multiple consecutive time points. (In principle, curve- 
fitting may result in more robust time series but cur- 
rently available points in a particular experiment were 
too limited for smoothing.) This is why I discuss the 
other two sets of time-course experiments for identify- 
ing a complete set of cell cycle-regulated genes and 
regulatory sequences in yeast. 

Figure 2 shows a current model illustrating inter- 
actions that determine cell cycle-regulated transcrip- 



tion in yeast (Koch and Nasmyth 1994; Mclnerny et al. 
1997). C/tt3-associated kinase activates late G t -specific 
transcription factors [SBF (SCB binding factor) and 
MBF (MCB binding factor)] in a cell size-dependent 
fashion. SBF and MBF mediate the expression of 
CLN1,2 and CLB5,6 as well as S-phase proteins leading 
to budding and S-phase entry. CLN1,2 activity allows 
accumulation of Clbs by an unknown mechanism. 
Clbl and Clb2 activate transcription of G 2 -specific 
genes and thereby autoactivate their own synthesis, 
possibly via transcription factors Mcml and Sff. At the 
same time, Clbl,2/cdc28 represses SBF-mediated tran- 
scription. Whereas Cibl,2/cdc28 activates expression of 
SWI5 and possibly of ACE2 RNAs via mcml/Sff, it keeps 
the gene products in an inactive state by phosphory- 
lation of the nuclear localization signals. Clb proteoly- 
sis at the end of mitosis dramatically changes the situ- 
ation: C/fc-mediated activation of G 2 -specific genes is 
stopped, and Swi5 loses its inhibitory phosphoryla- 
tions, leading to its uptake into the nucleus where it 
can activate early G } -specific transcripts. At late M 
phase, a wcwJ-related factor binds to ECB (early cell 
cycle box) and initiates M/G, -specific activation of 
CLN3, SWI4, and some DNA replication genes; these 
gene products have critical roles in promoting the ini- 
tiation of the next S phase. 

Since oscillation of histone mRNAs was discovered 
(Hereford et al. 1981), 103 cell cycle-regulated mes- 
sages have been identified using traditional methods, 
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Figure 2 A current model of transcriptional control of the yeast cell cycle. 



and it was estimated that some 250 in total might exist 
(Price et al. 1991). To create a comprehensive catalog of 
yeast genes whose transcript levels vary periodically 




Figure 3 Gene expression patterns of 800 cell cycle-regulated 
genes are sorted by the phases of the Fourier transform. At each 
time point, red/green means that synchronous cells have higher/ 
lower expression compared to asynchronous cells. Different 
phases are color coded on the top. The genes having expression 
levels peak in a specific phase are also coded by the same color 
scheme at right. 



with the cell cycle, two independent sets of genome- 
wide transcriptional experiments have been completed 
recently (Cho et al. 1998; Spellman et al. 1998). 

Cho et al. (1998) used the commercially available 
oligonucleotide arrays and temperature-sensitive cdc28 
mutant synchronization. After normalization of the 
expression profiles, cell cycle-dependent periodicity 
was found for 416 of the -6200 monitored transcripts. 
These genes were classified into five groups (early G,, 
late G lf S, G 2 , and M) according to their visual peak 
positions and the consistency with known genes. 



Table 2. Example Scores and Oscillation Amplitude for I 
a Collection of Genes 



Rank 


Score 


Gene 


Peak to trough 


1 


15.99 i 


PIR1 


27.3 


9 


10.90 


CLN2 


12.1 


37 


8.78 


CLB1 


9.4 


82 


6.51 


BUD9 


7.0 


177 


4.25 


STE3 


12.8 


224 


3.55 


TUB4 


4.8 


255 


3.29 


DUN1 


4.2 


401 


2.37 


CIN8 


5.4 


407 


2.33 


TUB2 


5.5 


585 


1.71 


MET1 


3.0 


800 


1.314 


STP4 


5.9 


844 


1.28 


SEC8 


4.2 


861 


1.25 


TUB1 


2.7 


1258 


0.92 


ANP1 


3.1 


1799 


0.71 


TUB1 


3.0 


2499 


0.54 


TUB3 


2.7 


2673 


0.50 


IME2 


3.5 


6054 


0.05 


RPS8B 


10.9 



i ; 

> See Spellman et al. (1998) for details. 
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A SCB: CACGAAA B SCB variant: CGCGAAA C MCB: ACGCGT 
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Figure 4 Comparison of some consensus distributions (counts per gene at 50-bp intervals) in the upstream regions of each phase group 
and control group. 

Through matches of known TF sites and visual exten- 
sion of over-represented short oligonucleotides; a 
dozen putative motifs (including MCB, SCB, ECB/ 
MCM1) were identified in the promoter regions (Table 
2 in Cho et al. 1998). 

In contrast, Spellman et al. (1998) used "home- 
made" cDNA microarrays and samples from yeast cul- 
tures synchronized by three independent methods: 
a-factor arrest, elutriation, and blockage of a cclclS 
temperature-sensitive mutant. Using Fourier transform 
and Pearson-type correlation algorithms, 800 genes 
were identified that met an objective minimum crite- 
rion for cell cycle regulation. In separate experiments, 
designed to examine the effects of inducing either the 
Gj cyclin Cln3p or the B-type cyclin Clb2p, it was 
found that mRNA levels of more than half of these 800 
genes responded to cyclin induction. These 800 gene 
expression patterns (sorted by the phases) are shown in 
Figure 3. The genes were divided into the following five 
groups: M/G, (113 genes), G T (300 genes), S (71 genes), 
G 2 (121 genes), and M (195 genes). Spellman et al. 
(1998) used the following clustering strategy: Five pro- 
files were built from genes known to be expressed in 
each class. The averaged peak correlation score (de- 
fined as the highest correlation value between the log 
2 (ratio) data series for each gene and each profile) of 
different experiments was used to construct an objec- 
tive "aggregate CDC (cell cycle-clependent clustering) 




rg^ Alpha j sdc 1 5 cdc28 Elu 





§MCM 



|CLB2 




Hi stones 



Figure 5 Genes that share similar expression profiles are 
grouped by correlation clustering (Eisen et al. 1998). The den- 
dragram (left) shows the structure of the cluster relationship. 
Nine clusters for promoter analysis are marked in blue. 
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Table 3. Potential UAS Motifs Found in Each Cluster 



Name 


Genes 


Group 


Motifs 


Sites > 
(% genes) 


Sites 
(% genes out of 
256 controls) 


Cho 
et al. 
(1998) 


Cln2 


58 


Gi 


MCB:ACGCGT 


52 (62) .: 


15(6) 


+ 








SGRCRCGAAA 




33 (\Vi 




Y' 


31 


Gi 


RAP1 :TGCACCW 


42(71) : 


33(12) 










7-AGCSGCT etc 




16 f"31 
1 0 v,->; 




Fksl 


38 


C, 


SCB:CRCGAAA 


26(53) 


33 (13) 


+ 








7'TKCAKCTCCA 


4/1 IV 


3 rn 




Histone 


9 


S 


CCA:GcGAArytngrGAACr 


1 9 (1 00) 


0(0) 












1 ft }kQ\ 
i o \oy/ 












SCB:CCCGAAA 


7(56) 


14(5) 


+ 


Met 


20 


S/G 2 


Cbfl/Met/Met28:TCAGGTG 


20 (60) 


17(5) 










Meti 1 /Met3z:/\AAn i *j i uu 


1 A YX. CV 


12 (5) 




Clb2 


36 


G 2 /M 


Mcml(P-box):TTWCCyaawnnGGwAA 


55 (64) 


1 (0) 


+ 








Mcm1 (P-box):+Sff:(P)n 2-4 RTaAAYAA 


19(47) 


^ 0(0) 




Mem 


38 


M/G, 


ECB:TTTCCcaATngGCAAA 


73 (79) 


1 (0) 


+ 








?:AAAGAAAA 


26 (53) 


20 (8) 




Sicl 


27 


M/G, 


SWI5:RRCCAGCR 


23 (48) 


23(9) 










?:GCSCRGC 


12(41) 


31 (11) 




Mat 


13 


M/G, 


Stel 2(PRE):TGAAACA 


10(54) 


48(18) 










P'+Q:tTTCCTaaTTrGknnnTCAATG 


8(46) 


0(0) 










?:WnAnnAGCCAnnnnWWnMAAAnA 


6(46) 


2(1) 





(?) An unknown motif. Motif site counts (percent of genes containing the motif) in each cluster. and in the control are also shown. (+ 
or -) The motif was found or not found in Cho et al. (1998). As Y' is full of repeats, there are many "motifs" that look significant on 
pure statistical ground. All sites were counted on both strands in the (- 700 y - 1) region, except MCB:ACGCCT was counted on one 
strand and histone motifs were counted only in the commonly shared promoter regions. 



score." Genes were ranked by their aggregate CDC 
scores, and the list was examined to determine a 
threshold that was exceeded by 91% of known cell 
cycle-regulated genes. Altogether, 800 genes met or ex- 
ceeded the threshold. Clustering of randomly shuffled 
data indicated that the false-positive rate should be 
<10%. Table 2 provides some examples of the kinds of 
scores obtained for several genes (including specific ex- 
amples that are and are not cell cycle regulated). 

Using a newly developed Saccharomyces cerevisiae 
promoter database and analysis tools (Zhu and Zhang 
1999), the 700-bp upstream regions of the five phase 
groups were analyzed further. 

Spellman et al. (1998) first computed the relative 
pentamer information (an olignucleotide bias mea- 
sure) of each phase group versus the control group of 
non-cell-cycle genes (Zhang 1998b). They then tried to 
extend the informative oligomers or to find other 
longer motifs by using GibbsDNA (Z. Ioschikhes and 
M.Q. Zhang, unpubl.), which is another modified ver- 
sion of the Gibbs sampler and includes features, such 
as double strand, palindrome symmetry, distance con- 
straint and submotif inclusion/exclusion. As Gibbs 
sampling is a stochastic process, a sufficient number of 
runs had to be carried out for each data set with various 
parameters, and the motifs that had higher maximum 



aposteri probability (MAP) values were selected. Once 
motifs were established for a group, their predictive 
value was tested by searching for the motif consensus 
(with specified mismatches) in the promoter regions of 
all groups, as well as for the control group. Figure 4 
shows the selectivity of some consensus motifs with 
respect to different regulatory groups and positions. 

Because the cutoffs for different phase groups were 
somewhat arbitrary, to search for better coregulated 
gene clusters, the Peason-type correlation clustering al- 
gorithm (Eisen et al. 1998) was used to identify nine 
clusters [data from Cho et al. (1998) was also included 
during clustering for completeness]. The dendragram 
of these clustered expression profiles is shown in Figure 
5. UAS motifs for each of these clusters were calculated, 
and results are shown in Table 3. As the statistics show, 
many of these motifs contain information predictive 
of cell cycle regulation (see Fig. 4). A full description 
and complete data sets are available at http://cellcycle- 
www.stanford.edu and at http://www.cshl.org/ 
mzhanglab. 

It is clear that multiple time points are more useful 
and better for clustering and promoter analysis. The 
most obvious difference between the results of Spell- 
man et al. (1998) and Cho et al. (1998) is the number 
of cell cycle-regulated genes and promoter elements 
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identified. With a manual decision process, Cho et al. 
found 421 genes to be cell cycle regulated. A set of 800 
genes found by Spellman et al. includes 304 of these, 
but the other 117 do not appear significantly cell cycle 
regulated in their experiments. The set of 800 genes 
therefore contains 496 genes not identified by Cho et 
al. The main cell cycle control promoter element SWI5 
site (among some others) was not identified in the 
analysis of Cho et al. because (according to Cho et al.) 
"Swi5 do not have a highly conserved binding se- 
quence, making it difficult to accurately search ge- 
nomic sequence for possible sites of action." Here, the 
SWI5 site was a good example for which the factor was 
well known generally, but the site had not been well 
characterized experimentally. Another advantage of 
the analysis of Spellman et al. was the diversity of ex- 
periments, which allowed them to distinguish cell 
cycle regulation from confounding patterns such as 
those caused by the heat shock response when a cul- 
ture is shifted from one temperature to another. Al- 
though transcriptional cascade could be better resolved 
by adding more time points, there are certainly tech- 
nical limits. Because differential stability could also af- 
fect the transcript level as well as transcription rate, a 
systematic detection of the turnover rate for each tran- 
script would be also crucial for more accurate global 
picture. 

In summary, with the help of genome-wide ex- 
pression techniques, it is possible to identify coregu- 
lated genes by clustering analysis. Furthermore, by 
combination of over-represented oligonucleotide 
analysis and multiple-sequence alignment programs, it 
is also possible to identify upstream regulatory motifs 
commonly shared by coregulated genes. Good cluster- 
ing is better than sophisticated motif-search algo- 
rithms. It would be highly desirable if one could com- 
bine motif and cluster analyses, as good clustering can 
facilitate motif identification, and, conversely, con- 
served motifs (or any other functional information re- 
lated to the sequences) can help to improve clustering. 
We ought to work toward a self-consistent iteration 
process (clustering coregulated genes -> detecting 
common motifs) as used in all scientific inference 
(functional groups -> conserved structures). Although 
1 have only discussed promoter motif detection in the 
context of array data analysis, it can be equally applied 
to a similar search of 3'-untranslated regions (3' UTRs). 
As the 3' UTR is often important for message stability 
and transport, identification of conserved motifs in 
this region may be also be instructive for gene regula- 
tion. 
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NOTE 

Recently, an interesting new clustering algorithm GENE- 
CLUSTER based on self-organizing maps has been developed 
(Tamayo et al. 1999); it was used to recluster the data of Cho 
et al. (1998) and found essentially similar results. Unfortu- 
nately, there was no regulatory sequence analysis. 
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